One of the first steps in the analysis of a new dataset, often as part of data cleaning, typically involves generation of high level summaries, such as: how many observations, attributes, which ones are predictors and which ones are (could be?) outcomes, what are their ranges, distributions, percentages of missing values, how strongly correlated are the predictors among themselves and with the outcome(s), etc. It is usually at this stage when we develop our initial intuition about the level of difficulty of the problem and of the challenges presented by this particular dataset and therefore form our first set of ideas as to how to approach it. There are many multivariate methods under unsupervised learning umbrella that are extremely useful in this setting (that will be introduced later in the course), but first things first, and here we will start by loading few datasets into R and exploring their attributes in the form of univariate summaries and bivariate plots and contingency tables (where applicable).
For this assignment we will use several datasets available from UCI machine learning repository that for convenience and as to not to depend on UCI ML repository availability have been also copied into this course website. Once you have downloaded them onto your computer, they can be loaded into R using function read.table with necessary options (of which most useful/relevant include: sep – defining field separator and header – instructing read.table to use fields in the first line as column headers). In principle, read.table can also use URL as a full path to the dataset, but here, to be able to work independently of network connection, we recommend that you download those datasets locally and provide read.table with appropriate paths to their local copies. The simplest thing is probably to copy them to the same directory where your .Rmd file is, in which case just the file name passed to read.table should suffice. As always, please remember, that help(read.table) (or, ?read.table as a shorthand) will tell you quite a bit about this function and its parameters.
For those datasets that do not have column names included in their data files, it is often convenient to assign them explicitly. Please note that for some of these datasets categorical variables are encoded in the form of integer values, that by default R will interpret as continuous variables while the behavior of many R functions depends on the type of the input variables.
The code excerpts and their output presented below illustrate some of these most basic steps as applied to one of the datasets available from UCI. The homework problems follow after that – they will require you to apply similar kind of approaches to generate high levels summaries of few other UCI datasets.
habDat <- read.table("haberman.data",sep=",")
colnames(habDat) <- c("age","year","nodes","surv")
summary(habDat$surv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.265 2.000 2.000
habDat$surv <- c("yes","no")[habDat$surv]
summary(habDat$surv)
## Length Class Mode
## 306 character character
habDat$surv <- factor(habDat$surv)
summary(habDat$surv)
## no yes
## 81 225
The following two examples below show generation of xy-scatterplots of age and node count for the patients in this dataset with color indicating their survival past 5 years using basic plotting capabilities in R as well as those provided by the package ggplot2.
oldPar <- par(mfrow=c(1:2),ps=16)
for ( iSurv in sort(unique(habDat$surv)) ) {
plot(habDat[,c("age","nodes")],type="n",
main=paste("Survival:",iSurv))
iTmp <- (1:length(levels(habDat$surv)))[levels(habDat$surv)==iSurv]
points(habDat[habDat$surv==iSurv,c("age","nodes")],col=iTmp,pch=iTmp)
}
par(oldPar)
ggplot(habDat,aes(x=age,y=nodes,colour=surv,shape=surv)) +
geom_point() + facet_wrap(~surv)
It seems that higher number of nodes might be associated with lower probability of survival. One attempt to quantify this relationship might involve testing relationship between indicators of survival and count of nodes exceeding arbitrarily chosen cutoffs (zero or 75th percentile in the example below). There is also substantial degree of overplotting due to integer values of node count and year that might, for instance, make it less apparent that not all patients with zero nodes survive if all of them were plotted in the same panel.
habDat$nodes0 <- habDat$nodes==0
table(habDat[, c("surv","nodes0")])
## nodes0
## surv FALSE TRUE
## no 62 19
## yes 108 117
habDat$nodes75 <- habDat$nodes>=quantile(habDat$nodes,probs=0.75)
table(habDat[, c("surv","nodes75")])
## nodes75
## surv FALSE TRUE
## no 39 42
## yes 178 47
Please feel free to model your solutions after the examples shown above, while exercising necessary judgement as to which attributes are best represented as continuous and which ones – as categorical, etc. The descriptions of homework problems provide some guidance as to what is expected, but leave some of those choices up to you. Making such calls is an integral part of any data analysis project and we will be working on advancing this skill throughout this course.
Lastly – do ask questions! Piazza is the best for that
This dataset presents an example of classification problem (authentic vs. counterfeit bank notes) using continuous predictors derived from image processing. More details about underlying data can be found in corresponding dataset description at UCI ML website. To load data into R please use data file data_banknote_authentication.txt available at the course website as well as in UCI ML dataset repository.
Once the dataset in loaded into R, please name appropriately data set attributes, determine number of variables (explain which ones are predictors and which one is outcome) and observations in the dataset (R functions such as dim, nrow, ncol could be useful for this), generate summary of the data using summary function in R and generate pairwise XY-scatterplots of each pair of continuous predictors indicating outcome using colour and/or shape of the symbols (you may find it convenient to use pairs plotting function). Describe your observations and discuss which of the variables are more likely to be informative with respect to discriminating forged bank notes from genuine.
Please comment on whether given the data at hand such problem appears to be an easy or hard one to solve. Try to guess using your best intuition, what could be an error in discriminating forged banknotes from genuine in this dataset: 50%, 20%, 10%, 5%, 2%, less than that? Later in the course we will work with this dataset again to actually develop such a classifier at which point you will get quantitative answer to this question, but for now, just from visual inspection of the scatterplots above, what do you think such an error could be? There is no wrong answer at this point, just try your best and make a note of it, so you can go back to it several weeks later. Also, please discuss some of the tradeoffs that might be associated with that. For instance, should one expect the cost of calling genuine note counterfeit to be the same as making the opposite call (allowing fake bills as authentic). Consider also anticipated frequency of these cases in the practical settings and how it could potentially interact with these costs.
banknote.data <- read.table("data_banknote_authentication.txt", sep = ",")
names(banknote.data) <- c("Variance", "Skewness", "Curtosis", "Entropy", "Counterfeit")
paste("The number of variables is", ncol(banknote.data))
## [1] "The number of variables is 5"
## The outcome is Counterfeit and the other four variables are predictors.
paste("The number of observations is", nrow(banknote.data))
## [1] "The number of observations is 1372"
summary(banknote.data)
## Variance Skewness Curtosis Entropy
## Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482
## 1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135
## Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867
## Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917
## 3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948
## Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495
## Counterfeit
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4446
## 3rd Qu.:1.0000
## Max. :1.0000
pairs(banknote.data[1:4], names(banknote.data)[1:4], col = c("green", "red")[banknote.data$Counterfeit + 1], main = "Relation of Banknote Image Measurements to Counterfeit Status")
In the above pairwise plots, green indicates an ordinary banknote. Red indicates a counterfeit banknote. As can be seen from the pairwise plots, variance and skewness seem to be related to whether or not a banknote is counterfeit, whereas curtosis and entropy are not. All pairwise plots containing either variance or skewness show a separation between the samples of ordinary and counterfeit banknotes, whereas the plots containing only the other two variables show no separation.
Variance and skewness seem to be mostly reliable for differentiating ordinary and counterfeit banknotes. But the problem is difficult to completely solve because the samples have overlap, and for observations with values that correspond to the overlap may be ambiguous. The error appears to be 5%. There is a trade off between accuracy and the importance of identifying counterfeit banknotes. On one hand, it is important to avoid fraudulent or improper transactions. Errors in such action could result in punishment for using counterfeit banknotes or the needless disposal of ordinary banknotes. If I were using a statistical model to determine whether my own banknotes were counterfeit, I would err on the side that they were because the consequences of disposal of ordinary money are only the value of the money, whereas the punishment related to using counterfeit money could be much greater. However, if it were someone else’s money that they were going to use in some way detrimental to me, I would err toward identifying it as real so that I would be prepared for the worst.
This dataset presents an example of a regression problem – predicting mollusc age from measurements that are easier to obtain. More details about underlying data can be found in corresponding dataset description. To load data into R please use file abalone.data available at the course website as well as at the UCI ML data repository.
Once the dataset in loaded into R, please name appropriately data set attributes, determine number of variables (explain which ones are predictors – categorical vs. continuous – and which one is the outcome) and observations in the dataset (R functions such as dim, nrow, ncol could be useful for this), generate summary of the data using summary function in R and generate pairwise XY-scatterplots of each pair of continuous attributes.
Describe your observations and discuss which attributes might be more useful for predicting snail age.
abalone.data <- read.table("abalone.data", sep = ",")
names(abalone.data) <- c("Sex","Length","Diameter","Height","Whole weight","Shucked weight", "Viscera weight","Shell weight","Rings")
head(abalone.data)
## Sex Length Diameter Height Whole weight Shucked weight Viscera weight
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## Shell weight Rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
paste("The number of variables is", ncol(abalone.data))
## [1] "The number of variables is 9"
## The outcome is Rings and the other eight variables are predictors. Sex is categorical and the other predictors are continuous.
paste("The number of observations is", nrow(abalone.data))
## [1] "The number of observations is 4177"
summary(abalone.data)
## Sex Length Diameter Height
## F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000
## I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150
## M:1528 Median :0.545 Median :0.4250 Median :0.1400
## Mean :0.524 Mean :0.4079 Mean :0.1395
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
## Max. :0.815 Max. :0.6500 Max. :1.1300
## Whole weight Shucked weight Viscera weight Shell weight
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300
## Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340
## Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388
## 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290
## Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050
## Rings
## Min. : 1.000
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 9.934
## 3rd Qu.:11.000
## Max. :29.000
for(predictor in names(abalone.data)[!(names(abalone.data) %in% "Rings")]) {
plot(abalone.data[[predictor]], abalone.data$Rings, xlab = predictor, ylab = "Rings")
}
Height and sex seem to be the best predictors of age as determined by number of rings, with a taller specimen likely to be older and a male or female specimen likely to be older than a specimen with gender “I.” The other variables are somewhat related but much poorer predictors.
For extra 5 points for some of the continuous and categorical predictors generate boxplots rendering relationship between them.